About the project

Welcome

to my humble landing page and to the Introduction to Open Data Science-course together with me and the others!
Data science is growing in importance and on this course we will be learning related practical skills and of course, have fun when doing it!

Here is a link to my GitHub repository


Analysis part of the R-exercise 2

This week I’ve worked with wrangling, tidying and interpreting questionnaire data and learned how to plot and test linear regression models. Seems like attitude is a main factor affecting the test scores among the variables included in the study. The week has been a good refresher of the previous statistics courses and I have learned more about using Rmarkdown for presenting the results from R.

Introduction to the data learning2014

The data in question is a survey of students’ study skills SATS and their attitudes towards statistics ASSIST gathered 3.12.2014 - 10.1.2015. Students’ approach to study skills were divided into deep, strategic and surface categories. Here are the descriptions from a presentation by Kimmo Vehkalahti:.

Surface approach: memorizing without understanding, with a serious lack of personal engagement in the learning process.
Deep approach: intention to maximize understanding, with a true commitment to learning.
Strategic approach: the ways students organize their studying (apply any strategy that maximizes the chance of achieving the highest possible grades).

Points variable states the points received in a statistics exam. Those who had 0 points (which means that the participant did not attend the exam) were filtered out from the data.

Summary of the data

learning2014 <- read.csv("https://raw.githubusercontent.com/BreezewindX/IODS-project/master/data/learning2014.csv")
summary(learning2014)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

The data consists of seven variables and has the query results of 166 persons, which of only 56 were men (maybe a bit small sample size). There was approximately twice the amount of females compared to males among the participants. The age of participants was between 17 and 55 years.

Scatterplot matrix between the variables

library(GGally)
ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

From the matrix we can see that the variables that correlate positively with points in the exam the most are attitude (0.437) and stra (0.146). Variable surf (-0.144) is negatively correlated. Correlation typically refers to how close two variables are to having a linear relationship with each other.

There is not that much correlation between the study skills, largest is -0.324 between ‘deep’ and ‘surf’ (which is notable, but logical as it’s unlikely that a student using surface approach uses also deep approach).

The distribution of age seems to be heavily skewed to the left, which means that a big portion of the participants were relatively young. The median age for men was a bit higher than for women.

The mean of attitude score for men is somewhat higher than for women and in general not that many men scored low in attitude. There seems to be not much difference in the use of deep study skills between genders and women seem to rely more on the strategic study skills.

We’ll choose the following linear model for regression \[points_i \sim \beta_i+\beta_2attitude_i+\beta_3stra_i+\beta_4surf_i+\epsilon_i\] Where \(\beta_i\) is the intercept and \(\epsilon_i\) is the error term.

# create a regression model with multiple explanatory variables
my_model <- lm(points ~ attitude + stra + surf, data = learning2014)

# print out a summary of the model
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

According to Multiple R-squared value, predictors jointly explain only 0.21 (21%) of the observed variance on the dependent variable. F-statistic for the regression tells us that we can reject the null hypothesis that all our explanatory variables are zero, so we don’t have to abandon the model at this point.

The positive coeffiecient of 3.40 for variable attitude is statistically significant even at 0.001 significance level.

P-values for variables stra and surf are higher than 0.05, which means we cannot, at the 95% confidence level, reject the null hypothesis that their coefficients are zero (ceteris paribus). Thus they are candidates for discarding.

Let’s test and alternative hypothesis that both stra and surf are jointly zero, \(H_0:\beta_3=\beta_4=0\).

library(car)
ss_results<- linearHypothesis(my_model,  c("stra = 0", "surf = 0"))
print(ss_results)
## Linear hypothesis test
## 
## Hypothesis:
## stra = 0
## surf = 0
## 
## Model 1: restricted model
## Model 2: points ~ attitude + stra + surf
## 
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    164 4641.1                           
## 2    162 4544.4  2    96.743 1.7244 0.1815

We get the P-value of 0.182 (>0.05) which means we cannot reject the joint null hypothesis at 95% confidence level.

Lets now drop these two ‘redundant’ variables. The new regression model is \[points_i \sim \beta_1 +\beta_2attitude_i+\epsilon_i\]

# create a regression model without redundant variables
new_model <- lm(points ~ attitude, data = learning2014)

# print out a summary of the model
summary(new_model)
## 
## Call:
## lm(formula = points ~ attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Now the Multiple \(R^2\) is almost the same as earlier, even if we dropped the two variables. This is good because adding more variables usually makes \(R^2\) grow. We can interpret this so that the model has a better fit. Residuals have now smaller variation. The explanatory variable explains ~19% of the variance in the exam points.

Evaluating the assumptions of the model

The assumptions made
1. errors are normally distributed
2. errors are not correlated
3. the errors have constant variance (homoscedasticity)

\[\epsilon\sim N(0,\sigma^2) \]

# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
par(mfrow = c(2,2))
plot(my_model, which=c(1,2,5))

Assessing assumptions visually

Looking at the Q-Q-plot, the assumption of normality of the residuals seems reasonable, as most of the points are located on the line, even if some outliers can be seen at the edges.

No obvious pattern emerges in the plotting of residuals vs fitted values (variance seems constant) and none of the observations have big leverage over the whole regression.

Visual inspection implies that these assumptions are met. It’s however a bit hard to say for sure that the variance is constant, so few tests might be in place to confirm this.

Testing if variance is a constant

We can test for multiplicative heteroscedasticity using Breusch-Pagan test, testing against \(H_0\) of no heteroscedasticity.

library(lmtest)
library(dplyr)
new_model %>% 
  lmtest::bptest() -> bpresults
print(bpresults)
## 
##  studentized Breusch-Pagan test
## 
## data:  .
## BP = 0.0039163, df = 1, p-value = 0.9501

We get p-value = 0.95, so cannot reject null hypothesis - we’ve found no evidence against homoscedasticity.

The White test is a generalization of the Breusch-Pagan test and may detect more general forms of heteroscedasticity.

whiteresults <- bptest(new_model, ~ attitude + I(attitude^2), data = learning2014)
print(whiteresults)
## 
##  studentized Breusch-Pagan test
## 
## data:  new_model
## BP = 0.088873, df = 2, p-value = 0.9565

With the White test we get a p-value of 0.957, which means that we have not found any evidence against the assumptions of our linear regression model.


Analysis part of the R-exercise 3

1 & 2 Preparing the data

The data are from two identical questionaires related to secondary school student alcohol comsumption in Portugal. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). You can find more information here.

Following adjustments have been made:
1. The variables not used for joining the two data have been combined by averaging (including the grade variables)
2. alc_use is the average of Dalc and Walc
3. high_use is TRUE if alc_use is higher than 2 and FALSE otherwise

Let’s takea look at the names of the variables in our dataset:

alc <- read.csv("https://raw.githubusercontent.com/BreezewindX/IODS-project/master/data/alc.csv")

colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

3 Choosing the variables

I’ll choose high_use as dependent variable and failures, absences and sex as explanatory variables.

My intuition is that
* a high use of alcohol is likely related to absences and also failures.
* drinking might lead to failing and failing might induce even more drinking
* absences might be indicator of failures by itself, even without high alcohol consumption, as there can be multiple reasons for these absences.
* alcohol consumption habits might differ between men and women.

The logistic regression model is \[logit(p)=log(p/(1-p))=highuse \sim \beta_1+\beta_2failures+\beta_3absences+\beta_4sex+\epsilon_i\]

4 Plotting the data

library(ggplot2)
ggplot(alc, aes(x = high_use, fill = sex)) + 
  geom_bar(position = "fill")

Majority of the students with high alcohol consumption were men.

library(ggpubr)
library(dplyr)
dens_score<- ggplot(alc, aes(x = G3)) +
geom_density()
x <- seq(1, 18, length.out=100)
df <- with(alc, data.frame(x = x, y = dnorm(x, mean(G3), sd(G3))))
dens_score2<-dens_score + geom_line(data = df, aes(x = x, y = y), color = "red")

dens_fail <- ggplot(alc, aes(x = failures, fill = high_use)) +
geom_bar(position="fill")

dens_abs <- ggplot(alc, aes(x = absences, fill = high_use)) +
geom_bar(position = "fill")

dens2_abs <- ggplot(alc, aes(x = absences, fill = high_use)) +
geom_histogram()

#  facet_wrap(~ high_use)
ggarrange(dens_score2, dens_fail, dens_abs, dens2_abs, 
          labels = c("A", "B", "C", "D"),
          ncol = 2, nrow = 2)

From A we can see that the scores are somewhat, even if not completely, normally distributed.

B shows the number of failures as a ratio between students with high and low alcohol consumption. We can see that the share of students with high alcohol use grows, as the amount of failures grow. The findings support the hypothesis that with high alcohol consumption grows the risk of failure.

Also, in C the share of students with high alcohol consumption grows as the amount of absences grow. This supports the view that students with high alcohol consumption have more absences than the students with low alcohol consumption.

D is graph C in absolute values, where we can see that the mumber of absencees is diminishing. There are some outliers in the data.

library(ggplot2)

# initialize a plot of high_use and G3
g1 <- ggplot(alc, aes(x = high_use, y = G3, col = sex))

# define the plot as a boxplot and draw it
g1 + geom_boxplot() + ylab("grade")

Distribution of grades seem more symmetric among students with low alcohol consumption, and grades of both sexes have more spread than in the case of high consumption. Grades of men are clearly more affected by heavy consumption, than women’s, as the mean of grades for women is almost the same in both cases. For men, there are outliers who received bad grades and how much they drink does not seem to affect it.

# initialise a plot of high_use and absences
g2 <- ggplot(alc, aes(x = high_use, y = absences, col = sex))

# define the plot as a boxplot and draw it
g2 + geom_boxplot() + ylab("abcences") + ggtitle("Student absences by alcohol consumption and sex")

Absences seem to increase when alcohol consumption is high, which makes sense as high consumption has tendency to make you feel like crap the next morning. Again, we can see that absences of men are more affected by drinking. Women have more absences than men when consuming a little and there are pretty high amount of absences for the outliers in the women’s data.

library(GGally)
vars <- c("high_use","failures","absences","sex")
ggpairs(alc, columns = vars, mapping = aes(col = sex, alpha = 0.3), lower = list(combo = wrap("facethist")))

Last but not least, we can see that there’s not much correlation between failures and absence. There was about the same amount of both of the sexes in the data, even if there was slightly more women.

5 Interpreting the logistic regression

# find the model with glm()
m <- glm(high_use ~ failures + absences + sex, data = alc, family = "binomial")
# print out a summary of the model
summary(m)
## 
## Call:
## glm(formula = high_use ~ failures + absences + sex, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1855  -0.8371  -0.6000   1.1020   2.0209  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.90297    0.22626  -8.411  < 2e-16 ***
## failures     0.45082    0.18992   2.374 0.017611 *  
## absences     0.09322    0.02295   4.063 4.85e-05 ***
## sexM         0.94117    0.24200   3.889 0.000101 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 424.40  on 378  degrees of freedom
## AIC: 432.4
## 
## Number of Fisher Scoring iterations: 4
#Round the coefficents for cleaner look
clean <- round(coef(m), digits=3)
print(clean)
## (Intercept)    failures    absences        sexM 
##      -1.903       0.451       0.093       0.941
mean(m$residuals)
## [1] 0.02143222

Here we can see the summary of the model. All explanatory variables are statistically significant at 95% confidence level. Absences and sex are significant even with 99.9%.

For every one unit change in failures, the log odds of high_use (versus low-use) increases by 0.451.
For every one unit change in absences, the log odds of high_use (versus low-use) increases by 0.093.
If the person in question is a male, the log odds of high_use increases by 0.941.

Therefore this fitted model says that, holding failures and absences at a fixed value, the odds of a high alcohol consumption for a male (sex = 1) over the odds of high alcohol consumption for female (sex = 0) is exp(0.941166) ≈ 2.563. In terms of percent change, we can say that the odds for men are 156% higher than the odds for women. The coefficient for failures says that, holding absences and sex at a fixed value, we will see 57% increase in the odds of having high alcohol consumption for a one-unit increase in failures since exp(0.4508198) ≈1.57. Variable absences has the same interpretation.

Residuals of the regression are close to mean zero, which is what we would want.

# compute odds ratios (OR)
OR <- coef(m) %>% exp

# compute confidence intervals (CI)
CI <- confint(m) %>% exp

# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                    OR      2.5 %   97.5 %
## (Intercept) 0.1491257 0.09395441 0.228611
## failures    1.5695984 1.08339644 2.294737
## absences    1.0977032 1.05169654 1.150848
## sexM        2.5629682 1.60381392 4.149405

Here we can see the odds ratios for the variables and their corresponding confidence intervals. Note that for logistic models, confidence intervals are based on the profiled log-likelihood function.

Following is an excerpt from here.

An odds ratio measures the association between a predictor variable (x) and the outcome variable (y). It represents the ratio of the odds that an event will occur (event = 1) given the presence of the predictor x (x = 1), compared to the odds of the event occurring in the absence of that predictor (x = 0). For a given predictor (say x1), the associated beta coefficient (b1) in the logistic regression function corresponds to the log of the odds ratio for that predictor. If the odds ratio is 2, then the odds that the event occurs (event = 1) are two times higher when the predictor x is present (x = 1) versus x is absent (x = 0).

From the odds ratios we have here we can say that it’s more likely that student has high alcohol consumption if there are failures present and if the student is a male the odds are over 2.5 times as high as if they were female.
Looks like the odds ratio is a square of the regression coefficient.

The confidence interval can be interpreted so that the true value of the coefficient belongs to this interval at 95% of the times. Confint function is showing two-tailed, from the 2.5% point to the 97.5% point of the relevant distribution, which form the upper and lower limits of the intervals.

Lets recap my intuition in the beginning of this analysis part:

  • a high use of alcohol is likely related to absences and also failures.

Seems like failures and high use of alcohol are related. We can see this from the statistically significant coefficients of the regression.

  • absences might be indicator of failures by itself, even without high alcohol consumption, as there can be multiple reasons for these absences.

Absences are also statistically significant, but presence of them does not raise the odds that much. There is probably many more reasons for being absent than alcohol intake.

  • alcohol consumption habits might differ between men and women.

The data-analysis would seem to support the evidence, that it’s more likely that men have high alcohol consumption in the study group, and that they also get lower grades and are more absent with high alcohol consumption.

6 Exploring the predictive power of the chosen model

Target variable versus the predictions (2x2) tabulation

# predict() the probability of high_use
probabilities <- predict(m, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   259    9
##    TRUE     84   30

We can see that the model predicted quite accurately when high_use was false, as we got only 9 predictions wrong out of 268.

However, it did not do so well when predicting when it’s true, as prediction was false 84 times out of 114. Graphical representation follows.

# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col=prediction))

# define the geom as points and draw the plot
g + geom_point()

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.67801047 0.02356021 0.70157068
##    TRUE  0.21989529 0.07853403 0.29842932
##    Sum   0.89790576 0.10209424 1.00000000

Computing the total proportion of inaccurately classified individuals (= the training error):

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss <- loss_func(class = alc$high_use, prob = alc$probability)
print(loss)
## [1] 0.2434555

On average, our model predicts the end result wrong approx. 24% of the predictions in the training data.

7 10-fold cross-validation on the model

# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2617801

Average number of wrong predictions in the cross validation is approximately 0.26. My chosen model is exactly the same as in the datacamp exercise, so of course the prediction error is the same. When comparing different models, we would prefer a model with smaller penalty (i.e. error). If one could re-select the explanatory variables so that the error would be smaller, that model could be preferred in prediction over this one. If more variables are added to the regression, the average error grows.


Analysis part of the R-exercise 4

1 Loading the inbuilt dataset Boston

library(MASS)
# load the data
data("Boston")

2 A brief overlook over the data

library(MASS)
# load the data
data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

Boston dataframe has 506 observations of 14 variables:

  • crim - per capita crime rate by town.
  • zn - proportion of residential land zoned for lots over 25,000 sq.ft.
  • indus - proportion of non-retail business acres per town.
  • chas - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
  • nox - nitrogen oxides concentration (parts per 10 million).
  • rm - average number of rooms per dwelling.
  • age - proportion of owner-occupied units built prior to 1940.
  • dis - weighted mean of distances to five Boston employment centres.
  • rad- index of accessibility to radial highways.
  • tax - full-value property-tax rate per $10,000.
  • ptratio - pupil-teacher ratio by town.
  • black - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
  • lstat - lower status of the population (percent).
  • medv - median value of owner-occupied homes in $1000s.

The information about the dataset can be found here.

3 Plotting the data

# plot matrix of the variables
pairs(Boston)

Pairs graph is a mess so far, so let’s wait if we can discard some variables and then maybe try drawing it again…

Data correlations

Let’s draw the correlations between the variables.

library("tidyverse")
library("dplyr")
library("corrplot")
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(digits=2)
# print the correlation matrix
#print(cor_matrix)
# visualize the correlation matrix
corrplot.mixed(cor_matrix, tl.cex = 0.7, number.cex = .8)

We can see that the most correlated variables are

library(reshape2)
CM <- cor_matrix                           # Make a copy of the matrix
CM[lower.tri(CM, diag = TRUE)] <- NA       # lower tri and diag set to NA
subset(melt(CM, na.rm = TRUE), abs(value) > .7)
##      Var1 Var2 value
## 59  indus  nox  0.76
## 89    nox  age  0.73
## 101 indus  dis -0.71
## 103   nox  dis -0.77
## 105   age  dis -0.75
## 129 indus  tax  0.72
## 135   rad  tax  0.91
## 195 lstat medv -0.74

Variables rad and tax are highly positively correlated (0.91). This can be interpretated so that the properties with better access to city’s radial highways also have higher property tax.

Next ones are nox and dis with -0.77 negative correlation. Interpretation is that smaller the weighted average distance to the five Boston employment centers, the larger is the nitrogen oxides concentration (worse air pollution).

As a third, with 0.76 positive correlation can be found ìndus and nox. Interpretation is that higher the proportion of non-retail business acres in town, higher is the concentration of the nitrogen oxides.

(One thing that can be noted is that the status of the inhabitants and the number of rooms correlate strongly with the housing value.)

For convenience, let’s plot histograms only for these variables of interest.

Distributions of the chosen variables

library(ggpubr)
library(ggplot2)
x <- seq(1, length.out=dim(Boston)[1])

rad_plot<-ggplot(Boston, aes(x=rad)) +
geom_histogram()

tax_plot <- ggplot(Boston, aes(x=tax)) +
geom_histogram()

nox_plot <- ggplot(Boston, aes(x=nox)) +
geom_histogram()

dis_plot <- ggplot(Boston, aes(x=dis)) +
geom_histogram()

indus_plot <- ggplot(Boston, aes(x=indus)) +
geom_histogram()

ggarrange(rad_plot, tax_plot, nox_plot, dis_plot, indus_plot, 
          labels = c("A", "B", "C", "D", "E"),
          ncol = 3, nrow = 2)

Plot A rad is index of accessibility to radial highways. There are two concentrations in the graph, first one is index of 4-5 and the second index 24. There are 132 observations in the latter one (with index value 24), which is a high percentage of all the observations. It would make sense to check out the map of Boston to better understand the outliers.

Plot B tax has the full-value property-tax rate, and the distribution is pretty similar to what we had in plot A, i.e. concentration in the first quartile and high amount of outliers in the last quartile (132 observations, value 666).

Plots C and D (nox and dis) have also similar distributions skewed to the right.

Plot E indus is skewed to the right, and has a high count (132 observations) on a single value (18.1).

There seems to be 132 observations in the data, that are outliers for some reason.

4 Standardising the dataset

boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865

Scaled dataset is of type matrix, so we’ll convert it into a dataframe. In the new dataframe all the variables have zero mean.

# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

Linear discriminant analysis produces results based on the assumptions that

  • the variables are normally distributed (on condition of the classes),
  • the normal distributions for each class share the same covariance matrix, and
  • the variables are continuous

Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. Divide the dataset to train and test sets, so that 80% of the data belongs to the train set.

bins <- quantile(boston_scaled$crim)

# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low","med_low","med_high","high"))

# look at the table of the new factor crime
#table(crime)

# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

5 Fitting and plotting the linear discriminant analysis on the train set

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
#lda.fit

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2,col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

From the graph we can see the variables that imply high crime, the most important factor being rad (the arrows pointing at the high crime cluster).

6 Predict the classes with the LDA model on the test data

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       12      10        2    0
##   med_low    3      15        2    0
##   med_high   0      15       18    2
##   high       0       0        1   22

The model seems to predict high crime very well (all are correct) and taking a look at the predictions as a whole, the model seems to do better predicting higher rates. In all cases, the model had predicted correctly more often than wrong.

7 Standardising the original dataset and calculating distances

First we calculate the Euclidean distance, and then with Manhattan distance.

library(factoextra)
data("Boston")
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)

# euclidean distance matrix
dist_eu <- get_dist(boston_scaled)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled, method="manhattan")

# look at the summary of the distances
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

Calculating the K-means

############
# K-means  #
############
k_max <- 10
set.seed(123)

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

# visualize the results

qplot(x = 1:k_max, y = twcss, geom = 'line')

From the graph we choose to have two centers, as it is the point where the slope of the line changes from steep to level (using the elbow method). Let’s run K-means with two centers.

# k-means clustering
set.seed(123)
km <-kmeans(boston_scaled, centers = 2)

    
library(tidyverse)
library(data.table)
boston_scaled %>%
  mutate(Cluster = km$cluster) %>%
  group_by(Cluster) %>%
  summarise_all("mean") %>%
DT::datatable(extensions = 'FixedColumns', options = list(dom = 't',
  scrollX = TRUE,
  scrollCollapse = TRUE))

You can scroll the datatable to see the rest of the variables, which are summarised by their means grouped by cluster number. We can see that there are differences between the two clusters, e.g. cluster #2 has higher crime, bigger proportion of non-retail business acres, more pollution, buildings are older and it’s further from the employment centers, there are less teachers per pupil and the median value of owner-occupied homes is lower.

Lets draw pairs according to the two clusters:

# k-means clustering
set.seed(123)
km <-kmeans(boston_scaled, centers = 2)
pairs(boston_scaled, col = km$cluster)

Color red is cluster #2 and black is nicer neighbourhood cluster #1.

We can see from the graph what we earlier saw numerically from the datatable. In the upper row we have crime against the variables, and we can see that cluster #2 has higher crime rates in every aspect than cluster #1.

# k-means clustering
set.seed(123)
km <-kmeans(boston_scaled, centers = 2)
pairs(boston_scaled[1:5], col = km$cluster)

If we compare crim and indus we see that the red cluster has higher crime and it is more industrialised than the black one. It’s the same with variable nox, and it’s logical that there is also more air pollution. zn tells us that red cluster is not near the river area, unlike the black cluster.

We could go on and on analysing all the variables in the graph, but I’m certain it’s not the meaning of this exercise.

Bonus

data("Boston")
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
set.seed(123)
k3 <- kmeans(boston_scaled, centers = 3)
k4 <- kmeans(boston_scaled, centers = 4)
k5 <- kmeans(boston_scaled, centers = 5)
k6 <- kmeans(boston_scaled, centers = 6)
k7 <- kmeans(boston_scaled, centers = 7)

# plots to compare
p2 <- fviz_cluster(k3, geom = "point",  data = boston_scaled) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point",  data = boston_scaled) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point",  data = boston_scaled) + ggtitle("k = 5")
p5 <- fviz_cluster(k6, geom = "point",  data = boston_scaled) + ggtitle("k = 6")
p6 <- fviz_cluster(k7, geom = "point",  data = boston_scaled) + ggtitle("k = 7")

library(gridExtra)
grid.arrange(p2, p3, p4, p5, p6, nrow = 3)

I choose k=3 as it has the best separation in my opinion.

set.seed(123)
km <- boston_scaled %>%
  kmeans(centers = 3)
lda.fit <-lda(km$cluster~.,data=boston_scaled)

classes<-as.numeric(km$cluster)
plot(lda.fit, dimen = 2, col=classes)
lda.arrows(lda.fit, myscale = 4)

Rad and tax (and maybe indus) are the most influencial linear separators for cluster #2. Age and chas for cluster #1. Dis and rm for cluster #3. Black affects towards clusters #1 and #3. Selecting the right amount of clusters seem to be important, so it’s useful to know some methods to choose optimal amount of them (for example using the elbow method we used earlier).

Super Bonus

data("Boston")
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
set.seed(123)  #Setting seed
bins <- quantile(boston_scaled$crim)

# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label=c("low","med_low","med_high","high"))

# look at the table of the new factor crime
#table(crime)

# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]
lda.fit <- lda(crime ~ ., data = train)
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

library(plotly)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
set.seed(123)
data("Boston")
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
myset <- boston_scaled[ind,]

km <-kmeans(myset, centers = 2)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km$cluster)

We can see that one cluster includes approximately the dots with high and med-high crime, and the other cluster the lower crime dots. There are no other differences between the two graphs, because I set the seed and the data for both is the same. I’m using the optimal two centers here for the clusters.


R-exercise 5

Loading and exploring the data

library(readr)
human <- read.csv("~/Documents/GitHub/IODS-project/data/human.csv", row.names=1)
str(human)
## 'data.frame':    155 obs. of  9 variables:
##  $ country     : Factor w/ 155 levels "Afghanistan",..: 105 6 134 41 101 54 67 149 27 102 ...
##  $ edu2F       : num  97.4 94.3 95 95.5 87.7 96.3 80.5 95.1 100 95 ...
##  $ labF        : num  61.2 58.8 61.8 58.7 58.5 53.6 53.1 56.3 61.6 62 ...
##  $ lifeExp     : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ expSchool   : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ gni         : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ mamo.ratio  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ adobirthrate: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ parlrep     : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
summary(human)
##         country        edu2F             labF          lifeExp     
##  Afghanistan:  1   Min.   :  0.90   Min.   :13.50   Min.   :49.00  
##  Albania    :  1   1st Qu.: 27.15   1st Qu.:44.45   1st Qu.:66.30  
##  Algeria    :  1   Median : 56.60   Median :53.50   Median :74.20  
##  Argentina  :  1   Mean   : 55.37   Mean   :52.52   Mean   :71.65  
##  Armenia    :  1   3rd Qu.: 85.15   3rd Qu.:61.90   3rd Qu.:77.25  
##  Australia  :  1   Max.   :100.00   Max.   :88.10   Max.   :83.50  
##  (Other)    :149                                                   
##    expSchool          gni           mamo.ratio      adobirthrate   
##  Min.   : 5.40   Min.   :   581   Min.   :   1.0   Min.   :  0.60  
##  1st Qu.:11.25   1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65  
##  Median :13.50   Median : 12040   Median :  49.0   Median : 33.60  
##  Mean   :13.18   Mean   : 17628   Mean   : 149.1   Mean   : 47.16  
##  3rd Qu.:15.20   3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95  
##  Max.   :20.20   Max.   :123124   Max.   :1100.0   Max.   :204.80  
##                                                                    
##     parlrep     
##  Min.   : 0.00  
##  1st Qu.:12.40  
##  Median :19.30  
##  Mean   :20.91  
##  3rd Qu.:27.95  
##  Max.   :57.50  
## 

human dataframe has 155 observations of 9 variables:

  • country - country name
  • edu2F - proportion of females with at least secondary education
  • labF - proportion of females in the labour force
  • lifeExp - life expectancy at birth
  • expSchool - expected years of schooling
  • gni - gross national income per capita
  • mamo.ratio - maternal mortality ratio
  • parlrep - percetange of female representatives in parliament
  • adobirthrate - adolescent birth rate.

From which edu2F and labF are proportional combined variables.

The database is a combined database of Human Development index and Gender Inequality index, which are a part of United Nations Development Programme’s Human Development reports.

More information about the dataset can be found here.

Technical notes can be found here.

This time there is no binary variable in the data. We can see that for some variables the mean and median are not the same number, so the distributions for them seem skewed. We will see them better next when we’ll plot the variables.

Plotting the data

Let’s start with histograms and density plots, followed by pairs and correlation graphs.

library(purrr)
library(tidyr)
library(ggplot2)

human %>%
  keep(is.numeric) %>% 
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram()

human %>%
  keep(is.numeric) %>%                     # Keep only numeric columns
  gather() %>%                             # Convert to key-value pairs
  ggplot(aes(value)) +                     # Plot the values
    facet_wrap(~ key, scales = "free") +   # In separate panels
    geom_density()

Some of the variables like expSchool, labF and parlrep look like they might be normally distributed. edu2F might be trimodal. adobirthrate, gni and mamo.ratio are skewed to the right, and lifeExp and expSchool to the left.

Next, I’ll create a categorical variable based on the quartiles of the GNI index, to see if better how the variables differ by their GNI score.

library(GGally)

# create a categorical variable 'gni'
bins <- quantile(human$gni)
gnicat <- cut(human$gni, breaks = bins, include.lowest = TRUE, label=c("low","med_low","med_high","high"))

ggpairs(dplyr::select(human, -country), mapping = aes(col = gnicat, alpha = 0.3), lower = list(combo = wrap("facethist")))

The order of the coloring is red, green, blue, purple from low to high scores in the GNI index.

library("tidyverse")
library("dplyr")
library("corrplot")
# calculate the correlation matrix and round it
cor_matrix<-cor(dplyr::select(human, -country)) %>% round(digits=2)
# print the correlation matrix
#print(cor_matrix)
# visualize the correlation matrix
corrplot.mixed(cor_matrix, tl.cex = 0.7, number.cex = .8)

Countries with higher GNI index score have higher education shares for women, than the lower ones. They also have higher expected years of schooling and life expectancy. Labor participation rate for women is on the higher end in the quartile that scores lowest in GNI. There’s surprisingly little difference between the four GNI quartiles when it comes to women’s parliamental participation rate (there is a presumption that these countries have parliaments). Only the fourth quartile has something of a normal distribution in it and the others are skewed to the right.

Let’s sum up the biggest correlations between the variables (arranged into a decreasing order in absolute value)

library(reshape2)
CM <- cor_matrix                           # Make a copy of the matrix
CM[lower.tri(CM, diag = TRUE)] <- NA       # lower tri and diag set to NA
corr_list<-subset(melt(CM, na.rm = TRUE), abs(value) > .7)
corr_list %>% arrange(desc(abs(value))) -> corr_list2
print(corr_list2)
##         Var1         Var2 value
## 1    lifeExp   mamo.ratio -0.86
## 2    lifeExp    expSchool  0.79
## 3      edu2F    expSchool  0.78
## 4 mamo.ratio adobirthrate  0.76
## 5  expSchool   mamo.ratio -0.74
## 6    lifeExp adobirthrate -0.73
## 7      edu2F      lifeExp  0.72
## 8      edu2F   mamo.ratio -0.72

Life expectancy had strong negative correlation (-0.86) with maternal mortality ratio. This one is pretty self-explanatory.

Life expectancy had a strong positive correlation with the expected years of schooling (0.79), and it is logical as when life expectancy is higher, parents will value schooling more as an investment.

Proportion of females with at least secondary education correlated positively (0.78) with expected years of schooling (of course as higher portion of population attend schooling, also the population average will rise).

Next one is positive correlation (0.76) between maternal mortality ratio and adolescent birth rate, which again makes sense, as higher the adolescent birth rate is (how many childbirths women give during their lives), naturally more mortality there is among these women.

As a last remark, education share for women seems to have a positive relationship with higher life expectancy and smaller maternal mortality ratio.

2 Principal component analysis (PCA)

First we perform the PCA to non-standardised data.

# perform principal component analysis (with the SVD method)
pca_human <- prcomp(dplyr::select(human, -country))

print(pca_human)
## Standard deviations (1, .., p=8):
## [1] 18544.172057   186.283543    25.972416    20.074641    14.321634
## [6]    10.634338     3.721257     1.428190
## 
## Rotation (n x k) = (8 x 8):
##                        PC1          PC2           PC3           PC4
## edu2F        -9.317458e-04  0.085169068 -0.3652346210 -0.8435797499
## labF          9.124960e-05 -0.031442122 -0.0180958993 -0.3928157700
## lifeExp      -2.815825e-04  0.028224407 -0.0170864406 -0.0212996883
## expSchool    -9.562916e-05  0.007556395 -0.0210421918 -0.0308785262
## gni          -9.999828e-01 -0.005830843  0.0006412388  0.0002381021
## mamo.ratio    5.655739e-03 -0.987500960 -0.1481355205 -0.0173448186
## adobirthrate  1.233962e-03 -0.125305410  0.9184673154 -0.3475453954
## parlrep      -5.526462e-05  0.003211742 -0.0038388487 -0.1075784134
##                        PC5           PC6           PC7           PC8
## edu2F        -3.770012e-01 -6.083913e-02  0.0303039622  3.118655e-02
## labF          8.233860e-01  4.077784e-01 -0.0093667016  6.654689e-03
## lifeExp       1.136966e-02 -6.669649e-02 -0.9901494274  1.161211e-01
## expSchool     3.896982e-03 -2.437473e-02 -0.1134676761 -9.925031e-01
## gni           6.651486e-06  2.062449e-05  0.0001028639  1.871381e-05
## mamo.ratio   -3.955532e-02 -2.010447e-02 -0.0244043639 -7.101321e-04
## adobirthrate -1.381061e-01 -2.499459e-02 -0.0127951595 -8.079546e-03
## parlrep       3.989026e-01 -9.077136e-01  0.0704544742  1.925677e-02
# draw a biplot of the principal component representation and the original variables

s <- summary(pca_human)
print(s)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6   PC7
## Standard deviation     1.854e+04 186.2835 25.97 20.07 14.32 10.63 3.721
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00  0.00  0.00 0.000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00  1.00  1.00 1.000
##                          PC8
## Standard deviation     1.428
## Proportion of Variance 0.000
## Cumulative Proportion  1.000
# rounded percetanges of variance captured by each PC
pca_pr <- round(100*s$importance[2, ], digits = 1)

Lets check out the percentages of variance of principal components and make the biplot.

# print out the percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 
## 100   0   0   0   0   0   0   0
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

# draw a biplot
biplot1<-biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

Seems like without standardising the data one principal component explains all of the variation, and the graph does not tell us much.

3 PCA with standardised data

Let’s now standardise the human data.

human_std <- scale(dplyr::select(human, -country))
# summaries of the scaled variables
summary(human_std)
##      edu2F               labF             lifeExp          expSchool      
##  Min.   :-1.76122   Min.   :-2.43186   Min.   :-2.7188   Min.   :-2.7378  
##  1st Qu.:-0.91250   1st Qu.:-0.50289   1st Qu.:-0.6425   1st Qu.:-0.6782  
##  Median : 0.03967   Median : 0.06116   Median : 0.3056   Median : 0.1140  
##  Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.96275   3rd Qu.: 0.58469   3rd Qu.: 0.6717   3rd Qu.: 0.7126  
##  Max.   : 1.44288   Max.   : 2.21762   Max.   : 1.4218   Max.   : 2.4730  
##       gni            mamo.ratio       adobirthrate        parlrep       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
# change the object to data frame
human_std <- as.data.frame(human_std)

Then perform the PCA (principal component analysis) for the standardised data.

# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_std)

print(pca_human)
## Standard deviations (1, .., p=8):
## [1] 2.1359720 1.1170602 0.8767066 0.7202736 0.5530904 0.5299143 0.4493522
## [8] 0.3372776
## 
## Rotation (n x k) = (8 x 8):
##                      PC1         PC2          PC3         PC4         PC5
## edu2F        -0.39874523 0.101737848 -0.210481695 -0.31033900  0.43437742
## labF          0.14439668 0.682158952 -0.575062704 -0.27300332 -0.22413576
## lifeExp      -0.43080653 0.003997077  0.073302641 -0.02783772  0.04920168
## expSchool    -0.41858363 0.138149663 -0.073869337 -0.07719277  0.30831448
## gni          -0.33914119 0.098797891 -0.338769060  0.84020987 -0.01249472
## mamo.ratio    0.41991628 0.123208094 -0.145471957  0.28520785  0.05493343
## adobirthrate  0.40271826 0.088902996 -0.003213286  0.11830579  0.81248359
## parlrep      -0.07626861 0.687286162  0.691544283  0.14537131 -0.01724555
##                      PC6         PC7         PC8
## edu2F        -0.50478735  0.48033587  0.12578655
## labF          0.23657515  0.01076780 -0.04754207
## lifeExp       0.57970641  0.04122211  0.68415054
## expSchool    -0.11017948 -0.81713841 -0.13919229
## gni           0.01564322  0.16310711 -0.16583233
## mamo.ratio   -0.43780233 -0.24286926  0.67254029
## adobirthrate  0.36928738  0.07464930 -0.11761127
## parlrep      -0.11284562  0.09265604 -0.02894539
# draw a biplot of the principal component representation and the original variables

s <- summary(pca_human)
print(s)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5    PC6
## Standard deviation     2.1360 1.1171 0.87671 0.72027 0.55309 0.5299
## Proportion of Variance 0.5703 0.1560 0.09608 0.06485 0.03824 0.0351
## Cumulative Proportion  0.5703 0.7263 0.82235 0.88720 0.92544 0.9605
##                            PC7     PC8
## Standard deviation     0.44935 0.33728
## Proportion of Variance 0.02524 0.01422
## Cumulative Proportion  0.98578 1.00000
# rounded percetanges of variance captured by each PC
pca_pr <- round(100*s$importance[2, ], digits = 1)

Lets check out the percentages of variance captured by the principal components and make a biplot.

# print out the percentages of variance
pca_pr
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 57.0 15.6  9.6  6.5  3.8  3.5  2.5  1.4
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

# draw a biplot
biplot2<-biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
*Figure A: Women's participation in society and childbearing*

Figure A: Women’s participation in society and childbearing

The results are now different. This is because PCA is sensitive to the relative scaling of the original features, thus standardising the data is a good idea.

4 The personal interpretation of the two PCAs

My interpretation of PC#1 is that it describes the relation between education and birthrate/maternal mortality, i.e. when families have less children who live longer, the education will be seen as more and more important. General education is also likely to improve women’s survival of childbirth.

My interpretation of PC#2 is that it describes the women’s participation in the society, i.e. for example representation in the parliament or participation in the labour markets. The principal component seems to be an aggregate of our two variables parlrep and labF and there does not seem to be a variable that has the opposite effect. I think this is because the variables are shares of multiple variables, which we had in the data wrangling section before combining them into this one.

For example in top left corner we have the Scandinavian countries with high schooling and high participation ration for women, and at the bottom we can find countries where women are known to be repressed in the state politics, like Iran, Iraq and Pakistan. These countries are still relatively advanced, as we can see from maternal mortality rate. Some African countries on the right have quite high participation ratios for women, but still maternal mortality rate is quite high.

5 Tea drinkers of the world unite

For the data tea, 300 tea consumers were interviewed about their consumption of tea. The questions were about how they consume tea, how they think of tea and descriptive questions (sex, age, socio-professional category and sport practise). 122 male and 178 female participated, from age 15 to 90, median age being 32 years and mean 37. We have categorical, qualitative and quantitative variables in the data. For the age, the data set has two different variables: a continuous and a categorical one. The data is part of FactoMineR package for R, and more information can be found from the FactoMineR homepage.

library(FactoMineR)
data(tea)
# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "age_Q")

# select the 'keep_columns' to create a new dataset
tea_time <- select(tea, one_of(keep_columns))

# look at the summaries and structure of the data
summary(tea_time)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                                                                     
##                   where       age_Q   
##  chain store         :192   15-24:92  
##  chain store+tea shop: 78   25-34:69  
##  tea shop            : 30   35-44:40  
##                             45-59:61  
##                             +60  :38
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ age_Q: Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...

We have selected columns Tea, How, how, sugar, where and age_Q from the data.

After this selection, the dataset tea has 300 observations of 6 variables of type factor.

# visualize the dataset
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

From the bar plot we can read the age distribution by categorical division. Note that the first age group in the graph is +60 (it would probably be the last one if named “60+”). We see that most of the people use tea bags and users of unpackaged tea are in minority. It’s likely that tea is bought in bags from a chain store, as the proportions are similar. The tea is usually drank without supplements, with only sugar added. Around half of the drinkers do not use sugar at all. Majority of the tea quality is Earl Grey tea, which is consumed about 2.5 times over the regular black tea.

# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.313   0.266   0.249   0.205   0.184   0.175
## % of var.             13.434  11.404  10.689   8.791   7.883   7.511
## Cumulative % of var.  13.434  24.838  35.527  44.318  52.201  59.712
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11  Dim.12
## Variance               0.168   0.147   0.141   0.135   0.119   0.101
## % of var.              7.218   6.302   6.051   5.787   5.086   4.348
## Cumulative % of var.  66.930  73.231  79.282  85.069  90.155  94.503
##                       Dim.13  Dim.14
## Variance               0.072   0.056
## % of var.              3.094   2.403
## Cumulative % of var.  97.597 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.068  0.005  0.002 |  0.048  0.003  0.001 | -0.471
## 2                  |  0.142  0.021  0.009 |  0.089  0.010  0.004 | -1.069
## 3                  | -0.199  0.042  0.033 | -0.254  0.081  0.053 | -0.600
## 4                  | -0.798  0.678  0.665 | -0.240  0.072  0.060 |  0.064
## 5                  | -0.199  0.042  0.033 | -0.254  0.081  0.053 | -0.600
## 6                  | -0.582  0.360  0.361 | -0.167  0.035  0.030 | -0.235
## 7                  | -0.190  0.039  0.022 | -0.036  0.002  0.001 | -0.411
## 8                  |  0.150  0.024  0.009 |  0.307  0.118  0.036 | -0.880
## 9                  |  0.268  0.076  0.026 |  0.900  1.016  0.290 |  0.175
## 10                 |  0.605  0.390  0.137 |  0.870  0.948  0.282 | -0.073
##                       ctr   cos2  
## 1                   0.296  0.106 |
## 2                   1.527  0.527 |
## 3                   0.481  0.297 |
## 4                   0.005  0.004 |
## 5                   0.481  0.297 |
## 6                   0.074  0.059 |
## 7                   0.226  0.103 |
## 8                   1.035  0.298 |
## 9                   0.041  0.011 |
## 10                  0.007  0.002 |
## 
## Categories (the 10 first)
##                       Dim.1    ctr   cos2 v.test    Dim.2    ctr   cos2
## black              |  0.750  7.377  0.184  7.421 |  0.467  3.365  0.071
## Earl Grey          | -0.389  5.179  0.273 -9.036 | -0.016  0.010  0.000
## green              |  0.594  2.063  0.044  3.610 | -0.954  6.272  0.113
## alone              | -0.096  0.316  0.017 -2.253 | -0.262  2.803  0.128
## lemon              |  0.483  1.366  0.029  2.938 |  0.202  0.282  0.005
## milk               | -0.091  0.093  0.002 -0.813 |  0.315  1.307  0.026
## other              |  0.938  1.404  0.027  2.853 |  2.736 14.071  0.232
## tea bag            | -0.460  6.376  0.277 -9.096 | -0.154  0.842  0.031
## tea bag+unpackaged |  0.174  0.504  0.014  2.031 |  0.719 10.150  0.236
## unpackaged         |  1.718 18.837  0.403 10.972 | -1.150  9.948  0.180
##                    v.test    Dim.3    ctr   cos2 v.test  
## black               4.618 | -0.740  9.026  0.179 -7.322 |
## Earl Grey          -0.367 |  0.334  4.788  0.201  7.751 |
## green              -5.800 | -0.293  0.629  0.011 -1.778 |
## alone              -6.183 | -0.047  0.098  0.004 -1.118 |
## lemon               1.229 |  1.264 11.746  0.197  7.685 |
## milk                2.811 | -0.379  2.011  0.038 -3.375 |
## other               8.321 | -0.957  1.836  0.028 -2.910 |
## tea bag            -3.046 | -0.436  7.208  0.249 -8.627 |
## tea bag+unpackaged  8.400 |  0.663  9.193  0.200  7.740 |
## unpackaged         -7.346 |  0.330  0.873  0.015  2.107 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.275 0.154 0.216 |
## How                | 0.060 0.295 0.235 |
## how                | 0.484 0.334 0.259 |
## sugar              | 0.132 0.013 0.200 |
## where              | 0.528 0.632 0.233 |
## age_Q              | 0.402 0.169 0.354 |

From the summary we can see that for the variable milk the test value is smaller in absolute value than the critical value of 1.96, so we cannot say that their coordinate is significantly different from zero respect to dim#1.

Respect to dim#2 the same goes for variables Earl Grey and lemon.

# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")

The distance between variable categories gives a measure of their similarity. Here the Dim 1 explains 13.43% of the variation and Dim 2 explains 11.40% of the variation. Neither dimension seems to explain the majority of the variation, and the percentages are quite close each other.

After plotting the MCA factor map, we can clearly see different groups formed:

Sugar and milk is being added more often to bag tea than unpackaged tea, more often to Earl Grey tea than to black or green. Lemon is added mostly to unpackaged tea. “Other” is a curious case, indicating strongly a Dim2.

dimdesc(mca)
## $`Dim 1`
## $`Dim 1`$quali
##               R2      p.value
## where 0.52835508 3.402919e-49
## how   0.48367544 2.339016e-43
## age_Q 0.40178129 7.360334e-32
## Tea   0.27495543 1.837208e-21
## sugar 0.13225028 8.232202e-11
## How   0.05978394 3.854850e-04
## 
## $`Dim 1`$category
##                         Estimate      p.value
## unpackaged            0.69474080 3.340834e-35
## tea shop              0.69465343 1.212375e-32
## black                 0.24172246 7.129309e-15
## No.sugar              0.20372089 8.232202e-11
## +60                   0.38974921 2.728687e-09
## green                 0.15432353 2.710546e-04
## 45-59                 0.11640619 5.815735e-04
## lemon                 0.09776924 3.157531e-03
## other                 0.35244199 4.165452e-03
## 35-44                 0.13205315 4.331606e-03
## tea bag+unpackaged   -0.16991252 4.203783e-02
## alone                -0.22634169 2.399737e-02
## chain store+tea shop -0.09539809 6.112076e-06
## sugar                -0.20372089 8.232202e-11
## Earl Grey            -0.39604598 2.004995e-22
## tea bag              -0.52482827 9.453587e-23
## 15-24                -0.60392196 3.365056e-30
## chain store          -0.59925534 2.883891e-33
## 
## 
## $`Dim 2`
## $`Dim 2`$quali
##              R2      p.value
## where 0.6318637 3.566828e-65
## how   0.3343163 5.692320e-27
## How   0.2947635 2.694094e-22
## Tea   0.1540126 1.635070e-11
## age_Q 0.1689345 3.629091e-11
## 
## $`Dim 2`$category
##                         Estimate      p.value
## chain store+tea shop  0.70929060 7.770361e-47
## tea bag+unpackaged    0.47161067 3.582736e-19
## other                 1.02576718 8.526004e-19
## black                 0.32725469 2.712739e-06
## +60                   0.31029257 3.430996e-06
## 35-44                 0.19254061 1.476821e-03
## tea bag               0.02118156 2.197206e-03
## milk                 -0.22316139 4.765518e-03
## 25-34                -0.33024542 1.272642e-07
## green                -0.40562971 2.539273e-09
## chain store          -0.03682077 2.085882e-09
## alone                -0.52113722 1.777027e-10
## unpackaged           -0.49279223 1.415218e-14
## tea shop             -0.67246984 5.949668e-20
## 
## 
## $`Dim 3`
## $`Dim 3`$quali
##              R2      p.value
## age_Q 0.3536548 5.876798e-27
## how   0.2585093 5.136908e-20
## where 0.2330683 7.697792e-18
## How   0.2348039 4.237186e-17
## Tea   0.2161434 1.968505e-16
## sugar 0.2003056 3.497087e-16
## 
## $`Dim 3`$category
##                         Estimate      p.value
## 25-34                 0.45976930 2.116883e-16
## Earl Grey             0.28300167 3.119648e-16
## tea bag+unpackaged    0.23830348 3.465086e-16
## sugar                 0.22363951 3.497087e-16
## lemon                 0.64614264 5.936838e-16
## chain store+tea shop  0.15382682 3.249970e-11
## tea shop              0.18552512 5.921016e-05
## 15-24                 0.16779074 5.659078e-03
## unpackaged            0.07220445 3.485880e-02
## +60                  -0.15408157 7.466100e-03
## other                -0.46305098 3.461660e-03
## milk                 -0.17423380 6.751923e-04
## black                -0.25323911 1.756827e-14
## 45-59                -0.37793254 5.344405e-15
## No.sugar             -0.22363951 3.497087e-16
## chain store          -0.33935194 6.755202e-19
## tea bag              -0.31050793 2.755334e-20

Using the dimdesc we can see that where you buy your tea and how it is packaged is strongly related to Dim 1. Next comes the age_Q and the tea type.

For Dim 2 the list is somewhat different, with where (chain store or tea shop), how (packaged or not) and How (if they use lemon or milk or other with the tea).

I found describing the Dim 2 somewhat problematic, as the group seems to fall between the two ends of buying bagged tea from chain stores and buying unpackaged tea from tea shops. With mostly other being the most important factor. What this “other” that people add in their tea is, I’m not sure. Maybe soy milk or similar, that does not fit under category “milk”? I could not find description for the variable from the home page, so I’m not sure if only one who knows what it means is the one who answered the questions.

My interpretation is that people belonging to Dim 1 are those who prefer drinking unpackaged tea, who buy it from a tea shop, drink it black without sugar and sometimes add a slice of lemon or drink green tea. For those estimates in the list that are negative, goes that if the individual has these characteristics, it is less likely for him/her belonging into this group.

We also make the observation that people belonging into this groups are likely to be older, which we also observe in the MCA plot earlier.

Let’s try out one extra plotting function of FactoMineR.

plotellipses(mca, keepvar=(1:5))

By drawing the confidence ellipses we can see that the confidence ellipses are quite small and don’t overlap, so the sub-populations here are pretty distinct.